Assessment of a non-invasive approach to pregnancy diagnosis in gray whales through drone-based photogrammetry and faecal hormone analysis

Knowledge of baleen whales' reproductive physiology is limited and requires long-term individual-based studies and innovative tools. We used 6 years of individual-level data on the Pacific Coast Feeding Group gray whales to evaluate the utility of faecal progesterone immunoassays and drone-based photogrammetry for pregnancy diagnosis. We explored the variability in faecal progesterone metabolites and body morphology relative to observed reproductive status and estimated the pregnancy probability for mature females of unknown reproductive status using normal mixture models. Individual females had higher faecal progesterone concentrations when pregnant than when presumed non-pregnant. Yet, at the population level, high overlap and variability in progesterone metabolite concentrations occurred between pregnant and non-pregnant groups, limiting this metric for accurate pregnancy diagnosis in gray whales. Alternatively, body width at 50% of the total body length (W50) correctly discriminated pregnant from non-pregnant females at individual and population levels, with high accuracy. Application of the model using W50 metric to mature females of unknown pregnancy status identified eight additional pregnancies with high confidence. Our findings highlight the utility of drone-based photogrammetry to non-invasively diagnose pregnancy in this group of gray whales, and the potential for improved data on reproductive rates for population management of baleen whales generally.

mixture models to the data derived from these non-invasive techniques. Given the inherent challenges of physiological studies of baleen whales, non-invasive diagnosis of pregnancy could significantly advance conservation management efforts of these threatened and protected animals through assessment of reproductive effort, success and loss.

Sampling location and field methods
We conducted sampling efforts from a small rigid-hulled inflatable boat (5.4 m) during the PCFG foraging seasons (late May to mid-October) along the central Oregon coast, USA (off Newport, 44°38 0 13 00 N, 124°0 3 0 08 00 W) annually from 2016 to 2021. Once a gray whale or whale group was located, we photographed individuals for identification purposes and conducted drone flights for photogrammetry analysis (details below) as weather conditions allowed. We opportunistically collected faecal samples at whale sightings using two dipnets outfitted with 300 µm nylon mesh; samples were immediately transferred to sterile plastic jars, placed on ice, and then stored in a freezer (−20°C) upon returning from the field within 2-6 h after collection. By contrast to faecal samples from some other species (e.g. North Atlantic right whale), faeces of gray whales in this study area consist of small particles that diffuse and sink quickly in the water column; thus, the amount of faecal material collected varied among samples depending on not just defecation mass, but also on environmental conditions and speed of sampling (e.g. how quickly dipnet collection began after defecation), which itself was contingent upon avoiding the whale's path. We performed hormone analysis within 11 months of collection (details below). We documented date, time and location for each faecal sample and linked these data to specific individuals via photo-identification matching [56,57].

Drone-based photogrammetry
We collected aerial videos of PCFG gray whales using drones (electronic supplementary material, table S1). We recorded videos at a minimum altitude of 25 m. We did not observe any behavioural responses of whales to the UAS (i.e. no change of direction, sudden dive, increased swimming speed, etc.). We extracted snapshots of individual whales from the aerial videos using VLC Media Player (v. 3.0.16; VideoLAN, Paris, France) for photogrammetry analysis. We measured the total body length (TL, measured as snout to fluke notch) and body width, in 5% increments between 20% and 70% of the whale's TL, using MorphoMetriX [58] and then processed using CollatriX [59]. Following our published methodology for this species, we then standardized the body widths by TL, which produces a scale-invariant and unitless metric that allows comparison across individuals with high precision [39]. All UAS are susceptible to photogrammetric uncertainty associated with the altimeter, camera, focal length and pixel measurement [36]. To incorporate this uncertainty associated with each UAS, we applied Bayesian methods to generate a posterior predictive distribution for each morphological measurement [36,39]. We used measurements of a 1.0 m wooden board floating at the surface in images collected between 20 and 70 m altitude as our calibration object and training data for the Bayesian statistical model [36].

Individual photo-identification: age, sex, and reproductive state
We used Adobe Bridge (v. 8.0.1.282) to assess whale identification photographs, using only high-quality images that were in focus and not affected by glare, angle or distance [56,57]. Sex was determined based on (i) observation (i.e. mother with a calf ), (ii) previous genetic analysis of tissue samples for individuals identified from the photo-ID catalogue [60], or (iii) from faecal sample genetics analysis [35]. We estimated the ages of individuals based on the length of their sighting history (LSH) from the photoidentification catalogue. Individuals first observed as calf were considered to have a 'known age' equal to the LSH, whereas individuals not first observed as a calf were considered to have a 'minimum age' equal to the LSH.
For this study, we investigated known female whales (n = 51 individuals) assigned into one of four reproductive classes: juvenile female (JF), mature female (MF), pregnant female (PF) and lactating female (LF). Individuals were classified as 'mature' if their known age or minimum age was more than or equal to 8 years (i.e. MF), which is the mean age of sexual maturity for gray whales based on histological examinations of gonads and lamina of earplugs [7,61], individuals with a known age less royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230452 than 8 years were classified as 'juveniles' (i.e. JF). Individuals with an unknown age (no sightings history) or a minimum age less than 8 years, were classified as mature if greater than 50% of their the estimated TL from posterior predicted distribution (see above), was greater than 11.7 m, which is the average length at maturity for female gray whales [7,36], and as juveniles if less then or equal to 50% of their estimated TL was less than 11.7 m. Females sighted with a calf at the time of sample collection were classified as LF, and whales observed with a calf the year after sample collection were presumed to be pregnant at time of faecal collection and classified as PF. We considered all immature and lactating females as presumed nonpregnant (lactation period lasts approx. seven months, and there are no known cases of lactating gray whales producing a new calf the next year). We classified as MF any mature females of unknown pregnancy status; this group is assumed to include a mixture of pregnant and non-pregnant whales. Therefore, we did not include the MF group in the initial analyses to develop pregnancy diagnostic statistical methods.

Faecal hormone metabolites
Faecal samples from 2016 to 2018 were previously analysed by Lemos et al. [22]. In subsequent years (2019-2021), we followed the same protocols for hormone extraction and quantification. Briefly, we filtered, desalted (to remove spurious inflation of dried faecal mass by salt crystals) and freeze-dried the faecal samples. We weighed the dried and homogenized samples to the nearest 0.1 mg, and excluded samples below 0.02 g from the analysis to avoid inflated values ('small sample effect'; see [4,62]). Faecal samples contain the metabolized breakdown products of progesterone; we used a progesterone assay kit whose antibody was originally raised against progesterone, but that also crossreacts to 5α-reduced progestin metabolites (#ADI-900-011), and thus our metric quantifies a subset of faecal progestin metabolites (fP4m). Specifically, the manufacturer reports cross-reactivities of 100% to 5a-pregnane-3,20-dione, 3.46% to 17-OH-progesterone, 1.43% to 5-pregnen-3b-o1-20-one, and less than 1% for all other tested steroids (https://www.enzolifesciences.com). We extracted the fP4m from the aliquoted faecal sample with 90% methanol (Methanol HPLC grade, Fisher Chemical), maintaining the sample mass to solvent volume ratio within a range of 1 : 10 to 1 : 25 and vortexing at room temperature for 30 min at 500 r.p.m. Then we centrifuged the mixture (sample and methanol) at 2200 r.p.m. for 20 min to separate the pellet from the supernatant with extracted hormones. We dried down the supernatant under vacuum and then we reconstituted the extracted hormones in deionized water with sonication and vortexing (1 : 1 dilution). We quantified the fP4m using a commercial Enzymelinked Immunosorbent Assay kit for progesterone (#ADI-900-011) from Enzo Life Sciences, following the manufacturer's protocols (https://www.enzolifesciences.com). Finally, we converted the raw data (pg ml −1 ) to ng of hormone per g of dried faeces correcting by the volume of extraction and the dilution factor used when applicable. This kit has been successfully used for pregnancy diagnosis using blubber samples of odontocetes [63]; blubber, serum and urine of bowhead whale (Balaena mysticetus) [64]; blubber of blue whale [65] and faeces of some terrestrial artiodactyls [66]. For quality assurance and quality control, we run all samples in duplicate, including a full standard curve (i.e. six standards with a concentration range from 500 to 15.62 pg ml −1 ), and an internal control (i.e. a progesterone standard of known concentration) in each assay. We reran any samples with greater than 15% coefficient of variation (CV) between replicates, and, if the sample fell outside of the per centbound range of 15-98%, we adjusted the dilution accordingly and reanalysed the sample. For the values below the limit of detection (less than LOD), we assigned a concentration of half the LOD reported by the manufacturer (i.e. LOD = 8.57 pg ml −1 according to the information reported by the manufacturer for the Progesterone EIA kit #ADI-900-011, https://www.enzolifesciences.com). When we collected more than one sample from one individual on the same day, we combined the samples into a single jar prior to analysis, except for a few cases (n = 6), where the whale ID was not confirmed for the faecal sample while in the field. We analysed these six samples separately; once it was determined (from photo-identification) that they were duplicate samples (i.e. another sample had been collected from that whale on the same day), in the analyses we included only the sample with higher faecal mass. The progesterone assay kit used in this study has previously been validated for gray whale faecal samples with satisfactory parallelism and accuracy [22].

Data analysis
We sought to determine whether the faecal progesterone and photogrammetric techniques were viable tools for non-invasive diagnosis of pregnancy in gray whales. To this end, we explored the data with royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230452 univariate and multivariate normal mixture models to assign probabilities of pregnancy for females with unknown reproductive status. Based on the evidence that gray whales conceive during the southbound migration [7,53], we assume that all faecal samples and morphometric measurements from pregnant females were collected at around six to nine months of gestation (gestation length approx. 13 months [7,53]). When multiple faecal samples were obtained from an individual in a given year, we reported and analysed the median apparent fP4m concentration for that year under the assumption that the median fP4m values would more accurately reflect the reproductive condition of each individual in a given season [31]. Similarly, when multiple photogrammetric measurements were collected from an individual in a given year, we analysed the maximum widths that an individual reached each year under the assumption that pregnant females will become wider over time. Furthermore, we only used photogrammetry measurements taken in the late season (after 26th August) each year. This late season cut-off was chosen to avoid including individuals in the early season that have recently arrived from their wintering lagoons after fasting, while still capturing most of the measurements available from the known pregnant females in our dataset. Only one PF (Er-0333, observed on 7 July 2021) was excluded using this cut-off. Nine other faecal samples were excluded from the analyses because either we could not reliably match the sample with an individual whale (n = 3), or because the sample came from an unknown individual with unknown sex or unknown maturity status (n = 6). Our final dataset consisted of 76 fP4m observations (including PF = 5, LF = 4, MF = 48 and JF = 19), and 77 morphometric measurements (including PF = 5, LF = 3, MF = 52 and JF = 17). In 42 cases, we obtained both fP4m and morphometric measurements from the same individual in the same year (including PF = 5, LF = 2, MF = 25 and JF = 10; electronic supplementary material, table S2). fP4m values were log-transformed to ensure that they were normally distributed, and normality was assessed visually with a normal Q-Q plot of the residuals (electronic supplementary material, figure S2).
As part of exploratory data analysis, we investigated how fP4m and the morphometric variables were distributed according to presumed pregnancy status (i.e. pregnant females (PF) versus presumed nonpregnant females (JF and LF)) and performed a one-way ANOVA, with a post hoc Tukey honestly significant difference (HSD) test for multiple comparisons (see electronic supplementary material, figures S1 and S2 for model's assumptions tests). We visually determined that the width at 50% of the total length (W50) was the morphometric variable that provided the best separation between the two groups, with the least overlap and dispersion (see Results, figure 3). The W50 also falls around the maximum body width of a gray whale's profile, which has been proposed as a good variable for recognizing nearterm pregnant gray whale females along their southbound migration [42]; therefore, only this morphometric variable was included in subsequent analyses.
We used Monte Carlo methods to propagate photogrammetric uncertainty by averaging the results of 80 000 replications of an ANOVA comparing W50 by female demographic unit (JF, LF, PF) [37]. For each replicate, we sampled each whale's W50 from a normal distribution parametrized with the posterior mean and variance from that whale's posterior distribution [44] and calculated the difference between the coefficients for W50 for each demographic unit. We then calculated the mean and highest posterior density intervals (HPDI) for the difference between each demographic unit.
We used the EM algorithm to fit a multivariate normal mixture model (R package: mixtools) to the log-transformed fP4m and W50 data from all females (Model 1). We assumed that the mixture had two components: one component characterized by low fP4m and low W50 ( presumed non-pregnant females) and one by high fP4m and high W50 ( presumed pregnant females). Therefore, the posterior probability of pregnancy for each combination of fP4m and W50 values in the data was then calculated as the ratio of the probability density for the component with higher means for the two variables to the sum of the two probability densities. We use a non-parametric bootstrap approach, similar to the one applied in Melica et al. [8], to quantify the uncertainty around each probability estimate. Specifically, we resampled the variables with replacement 10 000 times, fitted the mixture model to the bootstrapped dataset, and estimated the probabilities of pregnancy for all combinations of values in the data. Owing to the small number of data points from pregnant females, those records were included in all bootstrap samples [8]. We calculate the 95% confidence intervals using the 2.5th and 97.5th percentiles of the estimated probabilities of pregnancy. We also used the EM algorithm to fit univariate normal mixture models to each of the two variables separately. For fP4m (Model 2), we fitted a normal mixture model with two components, with the assumption that these would capture two groups of individuals characterized by either low ( presumed non-pregnant) or high ( presumed pregnant) fP4m, and similarly, for W50 we first fitted a model with two components (Model 3) assuming that the model would capture two groups of individuals characterized by either low ( presumed non-pregnant) or high ( presumed pregnant) W50.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230452 For each model, we only retained bootstrap samples if the mixture model identified both variables as having a greater mean for the component assumed to correspond to pregnant females (e.g. high fP4m or high W50); and we discarded non-convergent models [8]. Under these conditions, only 65% of the bootstrap samples were retained for the Model 1 and greater than 95% of the bootstrap samples were retained for the two univariate models (Models 2 and 3). We used the bootstrap procedure described above to calculate the probability of a female whale belonging to the component with high fP4m or high W50 (median and 95% confidence intervals). We compared output of the three models in terms of their ability to classify whales of known reproductive status, considering that the model classified an individual as pregnant when the assigned probability was greater than 75% (model performance,  table 3). Further, we applied the three models to assign a pregnancy probability to all MF and evaluated the agreement among these classifications. We conducted all statistical analyses in R, and tested mean comparisons between the PF and non-pregnant (LF and JF) groups with a significance level of 0.05.  Figure 2. Group mean comparisons for log-transformed faecal progesterone metabolite concentrations (in ng of immunoreactive hormone per g dried faeces (a), and standardized width at 50% of total body length (b) between presumed pregnant (PF = females seen with a calf the year after sampling) and presumed non-pregnant females (LF = lactating females observed with a calf the year of sampling and JF = sexually immature females). The black horizontal lines represent the group mean; box encloses 50% of the data; whiskers enclose the smallest and largest values within 1.5 times interquartile range below and above the 25th and 75th percentile, respectively; individual values shown as circles, with uncertainty in morphometric measurements represented as dashed lines, as 95% highest posterior density intervals); n.s. denotes no statistical difference between the group means, while asterisks denote a significant difference based on the ANOVA test. Table 1. Descriptive statistics of potential pregnancy indicators: mean (range) of the log of faecal metabolites of progestins (fP4m, in ng of immunoreactive hormone per g dried faeces) and standardized width at 50% of total body length (W50) by presumed reproductive status; PF: observed with a calf after the year of sampling, LF: observed with a calf the year of sampling, JF: immature female, MF: adult female not seen or not seen accompanied by a calf in the year after sample collection.  Table 2. Posterior probability of pregnancy P(p) for each combination of fP4m and W50 assigned to each female gray whale of known and unknown reproductive status via bootstrapping (N = 10 000). For each whale, P(p) was calculated based as the ratio of the probability density for the component with higher means for the two variables to the sum of the two probability densities for each model, and the lower (LQ), and upper (UQ) quartiles are reported. Additional information included year, age, age type, log-transformed faecal progesterone metabolites (fP4m), standardized maximum width at 50% of total body length in the late season (W50) and sighting history (RS: resighted in the following year, calf: resighted with a calf, and date: earliest date seen in the following year). Individuals with a P(p) > = 70% are bolded.       figure S1). Based on these exploratory analyses, we used W50 in subsequent analyses.

Comparing mixture models' performances
Overall, the three different models produced large confidence intervals around the probability of pregnancy assigned to each individual, illustrating the challenges of defining two components that accurately diagnose pregnant and non-pregnant states ( figure 4). Although the interquartile ranges of both fP4m and W50 indicate that extreme probability values are uncommon (table 2), they influence the models' performances, particularly due to the high overlap we observed in the range of fP4m for the two groups (Models 1 and 2). Model 3 (only using W50) was better able to separate the two groups and performed well at classifying individuals of known reproductive status, particularly the JF group ( figure 4 and table 2).    Table 3. Model performance comparisons. The models are compared in their ability to correctly classify individuals of known reproductive status; for this purpose, we considered all females with a probability of pregnancy greater than or equal to 75% to be classified as pregnant. TP, true positive, number of individuals classified as pregnant when they were presumed to be pregnant; TN, true negative, number of individuals classified as not pregnant when presumed not pregnant; FN, false negative, number of individuals classified as not pregnant when presumed pregnant; FP, false positive, number of individuals classified as pregnant when presumed not pregnant; and total, total number of individuals used for evaluating the models' performances; PF, females seen with a calf the year after sampling; LF, females seen with a calf the year of sampling; JF, immature female. The performance of the three models was assessed based on their ability to correctly classify whales of known reproductive status, i.e. presumed pregnant (PF) and presumed non-pregnant (LF and JF; table 3) females. Individual gray whales were predicted to belong to the pregnant groups when the mean probability assigned by the model was equal to or higher than 75%. Models 1 and 3 performed with reasonably low misclassification rate (less than 25%). However, Model 2, in which fP4m was the only variable used to classify pregnancy presented a higher misclassification rate (67%). The two models that include fP4m (Models 1 and 2) also exhibited significantly lower ability to correctly classify PF (true positive rate = 80%) compared with the model with only W50 (Model 3), in which all presumed pregnant females were correctly classified as pregnant. Hence, Model 3 also had a 0% false negative rate. Model 3 also had the lowest false positive rate, with an 8% probability of misclassifying non-pregnant individuals as pregnant. Hence, Model 3 that applied only W50 in a two-component mixture Model had the highest accuracy (95%) and lowest misclassification rate (8%).
Post hoc assessment of presumed non-pregnant individuals that were misclassified as pregnant (table 2) indicates that two immature females (JF; Er-0358 and Er-0377 in 2019) were classified as pregnant by all three models and two other JFs (Er-252 and Er-318 in 2019) were classified as pregnant by two models (Models 1 and 2). Owing to incomplete sighting histories that limit our knowledge of these whales' true age and maturity status, and potential for pregnancy loss, it is challenging to determine the accuracy of these pregnancy classifications. However, one of these two JFs (Er-0377 in 2019, consistently classified as pregnant by all three models) had a known age (age = 5 years) and was resighted in early February of the following year without a calf, whereas the other whale (Er-358 in 2019) had a minimum age (min age = 3) and was also resighted the following year with no calf in late August. The high probability of pregnancy for these individuals was influenced primarily by elevated levels of fP4m, but also by moderately high values of W50 (table 2). Of the other two JFs classified with high probability of pregnancy by both Models 1 and 2, one had a minimum age of 5 (Er-0318 in 2019) and was resighted without a calf in the following year in early June, and the other (Er-252 in 2019) had a known age (age = 5) and was resighted with no calf in March of the following year. These individuals presented moderately elevated W50 and elevated fP4m (table 2). In addition, 13 other JFs were classified as pregnant only by Model 2 (table 2), but due to the low true positive rate of Model 2 (table 3) we deem these unreliable classifications. Lastly, one LF (Er-0014 in 2018) was classified as pregnant with high probability by Model 3, yet this individual was resighted with no calf the following year in July. The estimated high probability of pregnancy for this individual was influenced primarily by the whale's moderately high W50. One additional LF was also classified as pregnant with high probability by Model 2 (Er-0019 in 2020) based on relatively high fP4m and was resighted the following year with no calf (table 2). However, given the high overlap observed in the ranges of fP4m in these two groups (PF and presumed non-pregnant JF and LF), these classifications may be unreliable.

Application of the models to assign probability of pregnancy to the mature females of unknown reproductive status
Mature females that were not observed with a calf the following year (MF), and from which we obtained fP4m and/or W50 data, presented fP4m concentrations and W50 measurements that fell both within and outside of the known-pregnant ranges. Two individuals (Er-0018 and Er-0323 in 2019, table 2) were consistently assigned high pregnancy probability by all three models. The multivariate model (Model 1) classified 6 out of 25 MFs as pregnant with probability (i.e. P(p) > 75%). By contrast, the univariate model based only on fP4m (Model 2), produced the largest number of MFs classified as pregnant with high probability, with 34 individuals out of a total of 48 assigned a pregnancy probability greater than 75%. This result probably reflects the high misclassification and false positive rate associated with this model (table 3). The univariate model using just W50 (Model 3), and the best overall performance (table 3)

Discussion
Our analysis of a 6-year-long dataset of faecal hormone metabolites, drone-based photogrammetry and sightings revealed the strengths of drone-based body morphology and weaknesses of fP4m (using the specific assay antibody used here) for non-invasive pregnancy diagnosis in PCFG gray whales. The use of the W50 metric in the univariate mixture model (Model 3) successfully separated PF and non-PF females, royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230452 while the high variability of fP4m limited its application to identify pregnancy accurately (Models 1 and 2). When comparing model performance validated by individuals of known reproductive status (table 3), the univariate model that used only W50 (Model 3) resulted in high accuracy (92%) with a low misclassification rate (8%). The multivariate approach using both W50 and fP4m (Model 1) is comparatively less accurate than Model 3, while the univariate approach using only fP4m (Model 2) resulted in the lowest accuracy with the highest false positive rate. Hence, it is clear that the lack of precision associated with the fP4m variable negatively influenced the performance of the multivariate approach.
Lemos et al. [35] used drone-based photogrammetry of PCFG gray whales to measure and compare body condition between individuals, using the body area index (BAI, [33]) metric and found that PFs (n = 3) were the demographic class with the highest BAI. Our finding that PFs are significantly wider than non-pregnant females align with these initial results presented in Lemos et al. [22,35]. Despite a low sample size of confirmed PF (n = 5), the body width at 50% of TL (W50) satisfactorily discriminated pregnant from non-pregnant females, and Model 3 provided a useful analytical approach to assign pregnancy probability. In addition to the five confirmed pregnant females, we identified eight MF of unknown reproductive status and two JF with high pregnancy probability (greater than 75%) using Model 3. Although the two JF would traditionally be classified as sexually immature, observational data from the western North Pacific gray whale population indicates that the maturity age for gray whales could be as young as 5 years [61]; hence, it is possible that these two individual JF cases corresponded to true pregnancies. Thus, the use of W50 in Model 3 allows us to provisionally increase the total number of PF in this 6-year study, which might imply higher pregnancy rates than estimated by calf sighting only.
Among the PF confirmed with calf resighting, two were observed in 2016, and one each year in 2017, 2018 and 2019 (table 2). Of the eight putative MF pregnancies identified by Model 3, most occurred in 2019 (50%; n = 4 of 8), followed by 2021 (25%; n = 1 of 4), 2016 (22%; n = 2 of 9) and 2018 (11%; n = 1 of 11). Out of these eight MF, four had both morphometric and hormone data. Among these four individuals, two were also classified with a high probability of pregnancy by Models 1 and 2 (table 2). It is possible that these whales were true pregnant but lost their calf before resighting, or the pregnancies did not reach full term (see conclusion). In 2017 and 2020, no MF was classified as pregnant by this model. Current sample sizes of MF are too small to detect any patterns in the annual variability in pregnancy rates; however, other baleen whale studies [37,54,67,68] noted that the proportion of pregnant females correlates with larger oceanographic fluctuations that influence prey availability. Continued long-term research programmes with targeted sampling towards the end of the season can improve sample size and allow increased exploration of temporal patterns in pregnancy rates and correlations with environmental conditions. Interestingly, the ENP reproductive rates estimated based on calf production [69] show declines corresponding to the current UME (2019-present) [70] and a previous UME (1999)(2000), with declines in the estimated abundance also occurring during these periods and in 2007-2010 [47,69]. Application of the Model 3 with aerial morphometric data (W50) collected from ENP gray whales during the southbound migration would provide an opportunity to assess pregnancy rates and estimate calf loss once data are compared with calf count data collected during the northbound migration. Such derived data could improve population models and evaluation of drivers of UMEs.
Our inability to reliably use fP4m to diagnose pregnancy in this study is probably a consequence of (i) the timing of faecal collection with respect to the gestation period, which could explain the high overlap in the fP4m range between the PF and presumed not PF groups, (ii) faecal consistency of this species and/or sample collection method, (iii) our low sample size, and/or (iv) the specific assay antibody used. The high overlap in fP4m levels between reproductive groups of female PCFG gray whales could be attributed to the timing of sampling, which falls between the first six to eight months of pregnancy when the fP4m concentrations may still be low, as gestation in gray whale females lasts approximately 13 months [7,48] and progesterone levels typically increase steadily throughout gestation [71]. While Lemos et al. [22] found slightly but statistically significantly elevated levels of fP4m in pregnant PCFG females (n = 4) as compared with the other demographic groups [22], studies of fP4m in other cetacean species have documented orders of magnitude higher levels in PFs as compared with non-pregnant groups (e.g. [27][28][29][30]). In addition, consistency of gray whale faecal samples may impose constraints on utility of fP4m in this species. PCFG gray whales typically produce faeces that consists of fine, unbound particles that rapidly disperse in the water column, forming a fast-sinking 'faecal plume' that poses challenges to the recovery of adequate amount of sample for representative hormone quantifications, as faecal steroid metabolites are probably unevenly distributed in the faeces [23]. Hence our opportunistic sample collection process may introduce royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230452 variability based on whether we collect a hormone 'hot spot' or not [23]. Evidence for this possibility is the relatively high coefficient of variation between samples collected from the same individual on the same day ranging from 0.36 to 65.
Although our sample size of PF was low (n = 5), it is comparable to previous studies that successfully applied fP4m to distinguish pregnancy in baleen whales (4-year study of North Atlantic right whales produced 3 PF [27]; 13-year study of North Atlantic right whales produced 14 PF [28]; 2-year study of humpback whales produced 4 PF [29]). In the North Atlantic right whale and humpback whale studies, fP4m analysis provided much clearer separation of pregnant from non-pregnant females than we document here for gray whales. These marked differences in the utility of fP4m data for pregnancy diagnosis in different studies may be due to species-specific differences in progesterone metabolization in the gut, and/or to the different assay antibodies used in these various studies ( prior studies on other large whales used an antibody that is no longer commercially available). Progesterone in terrestrial mammals is metabolized in the gut to up to 18 different faecal metabolites [18,72], with progesterone itself often no longer detectable at all, and the proportion and identities of the metabolites are highly species-specific and, sometimes, even population-specific (varying, for example, with diet, digestive enzymes and gut microbiota) [73]. The major progesterone faecal metabolites have not been identified for any species of large whale, because the necessary validations (infusions of radiolabelled progesterone, or 'challenge' experiments with infusions of hypothalamic and pituitary hormones) are not logistically feasible. Further, any given progesterone antibody typically cross-reacts with only some faecal metabolites of progesterone, such that different antibodies-even if originally raised against the same parent hormone, progesterone-can produce quite divergent data from faecal samples of the same species. Small faecal masses in this study prevented the comparison of multiple antibodies, but other commercial antibodies for faecal progesterone metabolites do now exist and could be tested. Thus, we suggest that other antibodies and potentially other hormone quantification methods (e.g. liquid chromatography with tandem mass spectrometry) also be explored for this species, as it is possible that another fP4m quantification method might yield improved data for pregnancy diagnosis in gray whales. The unique characteristics of our study system, i.e. high individual resighting rates, with over 30 years of sighting history and the non-invasive nature of the faecal hormone approach that allows us to obtain multiple samples across and within season, provide an advantage over other study systems to enable continued development of this technique, including testing alternative hormones, specific antibodies, or alternative determination and quantification techniques.

Conclusion
All species of baleen whales were heavily depleted by commercial whaling during the past several centuries, and today are exposed to multiple anthropogenic stressors (e.g. entanglement in fishing gear, vessel strikes, shipping noise, boat interactions, etc.; [74]). These stressors may cause direct mortality, but more frequently they led to indirect sublethal effects on individuals, such as long-term changes in health and reproduction that can ultimately result in impacts at the population level [75]. Moreover, detecting changes in a population's reproductive trend might indicate wider shifts in the marine ecosystem. For example, reproductive failure in large whales has been linked to changing environmental conditions [76], declines in prey availability [68,77,78], entanglements in fishing gear [79][80][81][82] and naturally occurring toxins [83]. Fecundity estimates for large whales are usually based on calf sightings, but such estimates have long been suspected to be underestimates of actual pregnancy rate, since some pregnancy loss and calf mortality presumably can occur before calves are sighted. Thus, calf sightings data may underestimate the reproductive capacity of the population, and may also underestimate the impact of potentially important natural and anthropogenic stressors, especially any that may disproportionately affect pregnant females or young calves. Therefore, researchers have attempted to diagnose pregnancy using several approaches [66], including quantification of hormone concentrations in faeces [27][28][29], blubber [8,9,11,84] and respiratory vapour [85]. Of these sample types, faecal analysis has the benefits of being completely non-invasive with minimal disturbance to the animal during sample collection and is now widely employed for studies of stress and reproductive physiology in terrestrial and aquatic wildlife [20,86]. However, the correct interpretation of faecal hormone data can be complex and requires careful validation, both analytically and biologically, when implementing with a new species or a new quantification technique [5,15,29]. Our assessment of the utility of applying the faecal hormone techniques and drone-based photogrammetry for determining pregnancies in the PCFG gray whales highlights the need for further testing and validation of faecal hormone methods.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230452 Interestingly, the univariate mixture model using only a morphometric measurement, length standardized body width at the 50% of the body length (W50), proved to be reliable in determining pregnancies in this population. Even with a small sample size of confirmed pregnancies we were able to apply these methods to accurately classify PFs. Thus, given that drones are becoming increasingly common in whale research programmes, we encourage research teams to evaluate this morphometric approach to diagnose pregnancy in other whale species and in other gray whale populations, e.g. the ENP or the endangered western North Pacific gray whale population. As demonstrated in our study, this non-invasive approach to pregnancy identification has the potential to improve our ability to monitor variation in important, yet challenging to estimate, baleen whale population metrics of pregnancy and calf loss rates.
Ethics. This work did not require ethical approval from a human subject or animal welfare committee. Data accessibility. The processed datasets generated for this study and relevant analysis code are available on the Figshare Digital Repository https://doi.org/10.6084/m9.figshare.22573231 [87].
The data are provided in electronic supplementary material [88].